2021-09-29

Introduction

Workflow

Each research question draws its own challenges which are unique in themselves. But:

Data ➡️ Cleaning ➡️ Fitness evaluation ➡️ Analysis

R and Mirroreum

Mirroreum offers more computational resources* and a standardized environment

https://docs.biodiversitydata.se/analyse-data/mirroreum/

SBDI4R 📦

- an R 📦 to search an access data

The SBDI4R package enables the R community to directly access data and resources hosted by SBDI.

Please refer to the package documentation for details on how to install it. Once installed the SBDI4R package must be loaded for each new R session:

library(SBDI4R)

Various aspects of the SBDI4R package can be customized.

  • caching

  • e-mail

  • download reason

Caching

SBDI4R can cache most results to local files. This means that if the same code is run multiple times, the second and subsequent iterations will be faster. This will also reduce load on the web servers. By default, this caching is session-based, meaning that the local files are stored in a temporary directory that is automatically deleted when the R session is ended. This behaviour can be altered so that caching is permanent, by setting the caching directory to a non-temporary location. For example, under Windows, use something like:

sbdi_config(cache_directory = file.path("c:","mydata","sbdi_cache")) ## Windows

or for Linux:

sbdi_config(cache_directory = "~/mydata/sbdi_cache") ## Linux

Note that this directory must exist (you need to create it yourself).

Required fields

E-mail address

Each download request to SBDI servers is also accompanied by an “e-mail address” string that identifies the user making the request. You will need to provide an email address registered with the SBDI. You can create an account here. Once an email is registered with the SBDI, it should be stored in the config:

sbdi_config(email="your.registered@emailaddress.com")

Else you can provide this e-mail address as a parameter directly to each call of the function occurrences().

Setting the download reason

SBDI requires that you provide a reason when downloading occurrence data (via the SBDI4R occurrences() function). You can provide this as a parameter directly to each call of occurrences(), or you can set it once per session using:

sbdi_config(download_reason_id = "your_reason_id")

(See sbdi_reasons() for valid download reasons, e.g. * 3 for “education”, * 7 for “ecological research”, * 8 for “systematic research/taxonomy”, * 10 for “testing”)

Privacy

NO other personal identification information is sent. You can see all configuration settings, including the the user-agent string that is being used, with the command:

sbdi_config()

Other options

If you make a request that returns an empty result set (e.g. an un-matched name), by default you will simply get an empty data structure returned to you without any special notification. If you would like to be warned about empty result sets, you can use:

sbdi_config(warn_on_empty=TRUE)

Other packages needed

Some additional packages are needed for these examples. Install them if necessary with the following script:

to_install <- c("colorRamps", "cowplot","dplyr","ggplot2", 
                "leaflet", "maps", "mapdata", "maptools", 
                "remotes","sf", "tidyr", "xts")
to_install <- to_install[!sapply(to_install, 
                                 requireNamespace, 
                                 quietly=TRUE)]
if(length(to_install)>0)
    install.packages(to_install, 
                     repos="http://cran.us.r-project.org")

remotes::install_github("Greensway/BIRDS")

🐟 data from SERS

The data

The Swedish Electrofishing Registry - SERS (Department of Aquatic Resources, SLU Aqua).

2.8 M observations starting in the 1950’s.

fq_str <- pick_filter("resource") 

Follow the instructions. Your choices here would have been ‘in3’ ▶️ ‘dr10’ (data resource 10 = SERS). Your variable fq_str will now contain a string ‘data_resource_uid:dr10’.

Filter

y1 <- 2008
y2 <- 2012
fq_str <- c(fq_str, paste0("year:[", y1, " TO ", y2,"]"))
# Note the square brackets are hard limits

Both filter strings (for data resource and for time period) will be treated as AND factors. For references on how to use the filters see the SBDI APIs documentation.

Occurrences

Using the function occurrences() we can now query for the observations fulfilling our filter. If you haven’t specified your email and the download reason in the sbdi_config() before, you need to pass this here.

xf <- occurrences(fq = fq_str)

# Remove what is not a species
xf$data <- xf$data[xf$data$rank == "species",]

# Simply summarise all records by data source 
table(xf$data$dataResourceName)

Plotting data on a map

You can quickly plot all the observations as a PDF file with the function ocurrence_plot(), one page per species:

Note that the plot is saved to a .pdf file in the current working directory. You can find that with getwd().

occurrences_plot(xf, "output/obsPlot.pdf", 
                 grouped=FALSE, 
                 taxon_level="species", 
                 pch='.')

There are many other ways of producing spatial plots in R. The leaflet package provides a simple method of producing browser-based maps with panning, zooming, and background layers:

library(leaflet)
# drop any records with missing lat/lon values
xfl <- xf$data[!is.na(xf$data$longitude) | !is.na(xf$data$latitude),] 
marker_colour <- rep("#d95f02", nrow(xfl))
# blank map, with imagery background
leaflet() |> 
  addProviderTiles("Esri.WorldImagery") |>
  # add markers
  addCircleMarkers(xfl$longitude, xfl$latitude,  
                   radius = 1, 
                   fillOpacity =.5, 
                   opacity = 1,
                   col=marker_colour,
                   clusterOptions = markerClusterOptions())

Temporal summary

A quick summary over the years reveals a drop in number of records over time.

hist(xf$data$year, breaks = seq(y1, y2), xlab = "Year")

Species summary

In the same way we can summarise the number of observations for each species, by common or scientific name.

sppTab <- table(xf$data$commonName)
sppDF <- as.data.frame(sppTab)
colnames(sppDF)[1] <- "species"
head(sppDF)
##           species Freq
## 1                   61
## 2 Alpine bullhead 4615
## 3 American burbot 7081
## 4        Aral asp    6
## 5     Arctic char   46
## 6    aurora trout  856

sppTab <- table(xf$data$scientificName)
sppDF <- as.data.frame(sppTab)
colnames(sppDF)[1] <- "species"
head(sppDF)
##                                species Freq
## 1       Abramis brama (Linnaeus, 1758)   61
## 2   Alburnus alburnus (Linnaeus, 1758)  660
## 3   Anguilla anguilla (Linnaeus, 1758) 2140
## 4     Astacus astacus (Linnaeus, 1758)  618
## 5 Barbatula barbatula (Linnaeus, 1758)  620
## 6     Blicca bjoerkna (Linnaeus, 1758)   74

Perhaps, you want to send this table as a .CSV file to a colleague. Save the table:

write.csv(sppDF, "output/SERS_species_summary.csv")
# NOTE: again this will be saved on your working directory

Spatial biodiversity analysis

Let’s now ask: How does the species richness vary across Sweden?

library(sf)

# load some shapes over Sweden's political borders
data("swe_wgs84", package="SBDI4R", envir=environment())

# a standard 50 km grid
data("Sweden_Grid_50km_Wgs84", package="SBDI4R", envir=environment())
grid <- Sweden_Grid_50km_Wgs84

obs <- st_as_sf(as.data.frame(xf$data),
                coords = c("longitude","latitude"),
                crs = st_crs(4326))

# overlay the occurrence data with the grid
ObsInGridListID <- st_intersects(grid, obs)
ObsInGridList <- lapply(ObsInGridListID, function(x) st_drop_geometry(obs[x,]))
wNonEmpty <- unname( which( unlist(lapply(ObsInGridList, nrow)) != 0) )

The result ObsInGridList is a list object with a subset of the data for each grid cell. Now summarise occurrences within grid cells:

# apply a summary over the grid cells 
nCells <- length(ObsInGridList)
res <- data.frame("nObs"=as.numeric(rep(NA,nCells)),
                  "nYears"=as.numeric(rep(NA,nCells)),
                  "nSpp"=as.numeric(rep(NA,nCells)),
                  row.names = row.names(grid),
                  stringsAsFactors = FALSE)

cols2use <- c("scientificName", "year")
dataRes <- lapply(ObsInGridList[wNonEmpty], 
                  function(x){
                    x <- x[,cols2use]
                    return(c("nObs" = length(x[,"scientificName"]),
                             "nYears" = length(unique(x[,"year"])),
                             "nSpp" = length(unique(x[,"scientificName"]))
                             )
                           )
                    }
                  )
dataRes <- as.data.frame(dplyr::bind_rows(dataRes, .id = "gridID"))
res[wNonEmpty,] <- dataRes[,-1]
resSf <- st_as_sf(data.frame(res, st_geometry(grid)))

And finally plot the grid summary as a map:

We may now ask whether species richness varies across latitude. So we go further by arranging the observations by latitude:

library(dplyr)
library(tidyr)
xgridded <- xf$data |>
    mutate(longitude = round(longitude * 4)/4, 
           latitude = round(latitude * 4)/4) |>
    group_by(longitude,latitude) |>
    ## subset to vars of interest
    select(longitude, latitude, species) |>
    ## take one row per cell per species (presence)
    distinct() |>
    ## calculate species richness
    mutate(richness=n()) |>
    ## convert to wide format (sites by species)
    mutate(present=1) |>
    do(tidyr::pivot_wider(data=.,  
                          names_from=species, 
                          values_from=present, 
                          values_fill=0)) |>
    ungroup()

## where no species was present, it shows NA. Here we convert these to 0
sppcols <- setdiff(names(xgridded),
                   c("longitude", "latitude", "richness"))
xgridded <- xgridded |> 
  mutate_at(sppcols, function(z) ifelse(is.na(z), 0, z))

And plot it accordingly:

library(ggplot2)

ggplot(xgridded, aes(latitude, richness)) + 
  labs(x = "Latitude (º)", 
       y = "Species richness") +
  lims(y = c(0,20)) +
  geom_point() + 
  theme_bw()

Opportunistic data on 🐲-flies

Name searching

In this example we are interested in exploring opportunistically collected data from the Swedish citizen science species observation portal - Artportalen.

To begin with, we want be sure there is an unequivocal way to find the species within the order Odonata (dragonflies) and nothing else, so let’s search for “odonata”:

sx <- search_fulltext("odonata")
## [1] "https://species.biodiversitydata.se/ws/search.json?q=odonata&fq=idxtype%3ATAXON"
sx$data[, c("guid", "scientificName", "rank", "occurrenceCount")]
##       guid                          scientificName    rank occurrenceCount
## 1  9829523  Odonata associated gemycircularvirus 1 species               0
## 2 10072832  Odonata associated gemycircularvirus 2 species               0
## 3  8062407 Bdellodes odonata Wallace & Mahon, 1976 species               0
## 4      789                                 Odonata   order           14421
## 5  7367071    Ramalina fastigiata var. odonata Hue variety               0

Let’s refine the search. To know which search fields we can use to filter the search we use the function sbdi_fields(fields_type = "general"). The search field we are looking for is “order_s”.

sx <- search_fulltext(fq="order_s:Odonata", page_size = 10)
## [1] "https://species.biodiversitydata.se/ws/search.json?fq=order_s%3AOdonata&fq=idxtype%3ATAXON&pageSize=10"
sx$data[, c("guid", "scientificName", "rank", "occurrenceCount")]
##       guid                             scientificName  rank occurrenceCount
## 1  1429753                Gomphomacromia Brauer, 1864 genus               0
## 2  1426725               Austropetalia Tillyard, 1916 genus               0
## 3  4799335                 Sogjutella Pritykina, 1980 genus               0
## 4  4302686                    Neuragrion Karsch, 1891 genus               0
## 5  4799353              Xamenophlebia Pritykina, 1981 genus               0
## 6  1429769               Lauromacromia Geijskes, 1970 genus               1
## 7  1428195                     Sympetrum Newman, 1833 genus             613
## 8  4798599 Corduliochlora Marinov & Seidenbusch, 2007 genus               0
## 9  1423625             Torrenticnemis Lieftinck, 1949 genus               0
## 10 1423468                  Cyanallagma Kennedy, 1920 genus               0

Now we can download the taxonomic data (note that the search is case-sensitive):

tx <- taxinfo_download("order_s:Odonata", 
                       fields = c("guid", "order_s","genus_s", "specificEpithet_s", 
                                  "scientificName",  "canonicalName_s", "rank"), 
                       verbose = FALSE)
## restrict to species and not hybrids
tx <- tx[tx$rank == "species" & tx$genusS != "",] 

You can save the tx object as the complete species list for later use.

Filters

We start by searching for the data resource we are interested in using the function pick_filter(). This is an interactive query guiding you through the many resources available to filtering your query (data resources, spatial layers, and curated species lists).

# follow the instructions 
fq_str <- pick_filter("resource") 

Follow the instructions. Your choices here would have been “in3” ▶️ “dr5”. Your variable fq_str will now contain a string “data_resource_uid:dr5”.

We only want to look at data from year 2000 to 2010:

y1 <- 2000
y2 <- 2010
fq_str <- c(fq_str, paste0("year:[", y1, " TO ", y2,"]"))
# Note the square brackets are hard limits

Spatial filter

We also want to filter spatially for Southern Sweden (Götaland).

SBDI APIs take as search input polygons in the so-called WKT Well Known Text format.

data("swe", package = "SBDI4R")
wGotaland <- swe$Counties$LnNamn %in% c("Blekinge", "Gotlands", "Hallands", "Jönköpings", "Kalmar",
                                        "Kronobergs", "Östergötlands", "Skåne", "Västra Götalands")
gotaland_c <- swe$Counties[wGotaland,]

Let’s construct the WKT string:

# transform the CRS
gotaland_c <- st_transform(gotaland_c, crs = st_crs(4326))

# disolve the counties into one polygon
gotaland <- st_union(gotaland_c)

# create a convex hull of the polygon
gotaland_ch <- st_convex_hull(gotaland)

# cast it as MULTIPOLYGON as this is what SBDIs API need
gotaland_ch <- st_cast(gotaland_ch, to = "MULTIPOLYGON")

# create WKT string
wkt <- st_as_text(gotaland_ch)

The WKT string then looks like this:

## [1] "MULTIPOLYGON (((13.33575 55.34003, 12.81633 55.38594, 11.25342 58.35786, 11.13161 58.90942, 11.13145 59.01184, 11.21142 59.0897, 11.31566 59.11651, 11.82032 59.23553, 11.94833 59.26237, 12.06197 59.27159, 12.23104 59.27357, 15.79383 59.03876, 15.84306 59.02498, 19.2889 57.99043, 19.3058 57.96888, 18.90037 57.44014, 18.86704 57.39753, 18.3725 57.00678, 18.30044 56.9528, 16.40805 56.20229, 14.19057 55.38557, 13.33575 55.34003)))"

Occurrences

Next, we download the observations using the command occurrences().

xf <- SBDI4R::occurrences(taxon = "order:Odonata", 
                          fq = fq_str, 
                          wkt = wkt, 
                          extra = "collector")

Quality and fit-for-use check

Before we can use the observation records we need to know if the observation effort (sampling effort) has varied over time and in space. we can

We can approximate observation effort from the data by defining field visits i.e. occasions at which an observer has sampled observations using the package BIRDS. Additionally we want the data to be summarized over a grid of 25 km. Please refer to the BIRDS package documentation for more detail.

remotes::install_github("Greensway/BIRDS")
library(BIRDS)

Organise and Summarise

OB <- organiseBirds(xf$data, sppCol = "species" , 
                    # We only want observations identified at the species level
                    taxonRankCol = "rank", taxonRank = "species", 
                    # the visits are defined by collector and named locality
                    idCols = c("locality", "collector"), 
                    timeCols = c("year", "month", "day"), 
                    xyCols =c("longitude","latitude") )

We don’t need the whole grid, just the piece that overlaps our searching polygon

wInt <- unlist(st_intersects(gotaland, Sweden_Grid_25km_Wgs84))
gotaland_grid25 <- Sweden_Grid_25km_Wgs84[wInt,]
SB <- summariseBirds(OB, 
                     grid = gotaland_grid25, 
                     spillOver = "unique")

Once summarised, we can see over space and for a few selected years how the number of observations is distributed

We now want to use the number of field visits as the measure for sampling effort:

vis <- ggplot(data = SB$spatial, aes( fill = nVis)) +
  geom_sf() +
  ggtitle("Number of visits")

spp <- ggplot(data = SB$spatial ,aes( fill = nSpp)) +
  geom_sf() +
  ggtitle("Number of species")

Temporal check

We see that SB contains an element called SB$temporal that contains a daily time series with time-specific rows when there is information. xts also supports day time, but dating below day resolution is not yet implemented in the BIRDS package.

sb.xts <- SB$temporal
head(sb.xts, 5)
##            nObs nVis nSpp
## 2000-03-24    1    1    1
## 2000-04-05    4    3    3
## 2000-04-06   11    6    3
## 2000-04-10    1    1    1
## 2000-04-12    3    3    1

Sub-setting is convenient in xts as you can do it with its dates and with a / for a range of dates.

sb.xts["2010-09-07"] #a specific day
sb.xts["2010-09-01/2010-09-15"] #for a period
sb.xts["2010-09"] #a specific month

The package xts has several tools for converting to different time periods. Here we use apply.monthly to obtain the total number of observations and visits per month. Read more in ?plot.xts.

library(xts)
obs.m <- apply.monthly(sb.xts$nObs, "sum", na.rm = TRUE)
vis.m <- apply.monthly(sb.xts$nVis, "sum", na.rm = TRUE)

plot(obs.m, col = "darkblue", grid.ticks.on = "month", 
     major.ticks = "year", grid.col = "lightgrey",  
     main = "Total number of daily observations and visits per month")
lines(vis.m, col = "orange", lwd=2, on=1)

Species trends

We pick two species and compare their trends in number of visits where the species where reported, relative to the total number of visits.

library(dplyr)
sppCountLq <- obsData(OB) |> 
            group_by(year,visitUID) |> 
            summarise("focalCount" = sum(scientificName == "Libellula quadrimaculata"),
                      "sppLength" = length(unique(scientificName)), 
                      .groups = "drop") |> 
            ungroup() |> 
            group_by(year) |> 
            summarise("focalCount" = sum(focalCount),
                      "nVis" = length(unique(visitUID)), 
                      .groups = NULL)

sppCountLq$relCount <- sppCountLq$focalCount/sppCountLq$nVis

sppCountSd <- obsData(OB) |> 
            group_by(year,visitUID) |> 
            summarise("focalCount" = sum(scientificName == "Sympetrum sanguineum"),
                      "sppLength" = length(unique(scientificName)), 
                      .groups = "drop") |> 
            ungroup() |> 
            group_by(year) |> 
            summarise("focalCount" = sum(focalCount),
                      "nVis" = length(unique(visitUID)), 
                      .groups = NULL)

sppCountSd$relCount <- sppCountSd$focalCount/sppCountSd$nVis

oldpar <- par(no.readonly = TRUE)
par(mar=c(4,4,1,1), las=1)
plot(sppCountLq$year, sppCountLq$relCount, type = "l", 
     lwd = 3, xlab = "Year", ylab = "Relative number of visits with observations", 
     ylim=c(0, max(sppCountLq$relCount)), xaxp=c(2000, 2010, 10))
lines(sppCountSd$year, sppCountSd$relCount, lwd=3, col="#78D2EB")
legend("bottomright", 
       legend=c("Libellula quadrimaculata","Sympetrum sanguineum"), 
       text.font = 3, col = c("black", "#78D2EB"), lwd = 3, bty = "n")

par(oldpar)

Your collaboration is appreciated

Open Source also means that you can contribute. You don’t need to know how to program but every input is appreciated. Did you find something that is not working? Have suggestions for examples or text? you can always

  1. Reach to us via the support center
  2. Submit and issue to the GitHub code repository see how
  3. Or contribute with your code or documents modifications by “forking” the code and submitting a “pull request”

The repositories you can contribute to are: